Stylistic Fingerprints, POS-tags, and Inflected Languages: A Case Study in Polish
نویسندگان
چکیده
In stylometric investigations, frequencies of the most frequent words (MFWs) and character n-grams outperform other style-markers, even if their performance varies significantly across languages. inflected languages, word endings play a prominent role, hence different forms cannot be recognized using generic text tokenization. Countless make sparse, making statistical procedures complicated. Presumably, applying one NLP techniques, such as lemmatization and/or parsing, might increase classification. The aim this paper is to examine usefulness grammatical features (as assessed via POS-tag n-grams) lemmatized in recognizing authorial profiles, order address underlying issue degree freedom choice within lexis grammar. Using corpus Polish novels, we performed series supervised authorship attribution benchmarks, compare classification accuracy for types lexical syntactic style-markers. Even POS-tags well was notoriously worse than that markers, difference not substantial never exceeded ca. 15%.
منابع مشابه
assessing political stability and instability in central asia and caucasus; case study, azerbaijan and kyrgyzstan
منطقه ی آسیای مرکزی وقفقاز به عنوان منطقه ای تاریخی و به دلیل دارا بودن ذخایر عظیم هیدرو کربنی از اهمیت ویژه ای برخوردار است. کشورهای این منطقه از عوامل عمده ی بی ثباتی نظیر عوامل جغرافیایی، اقتصادی، امنیتی، اجتماعی و سیاسی رنج می برند. پس از فروپاشی اتحاد جماهیر شوروی کشورهای منطقه از نعمت استقلال ناخواسته ای برخوردار شدند که مشکلات فوق را برای آن ها چندین برابر می کرد. در این روند برخی از این...
15 صفحه اولProjecting POS Tags And Syntactic Dependencies From English And French To Polish In Aligned Corpora
This paper presents the first step to project POS tags and dependencies from English and French to Polish in aligned corpora. Both the English and French parts of the corpus are analysed with a POS tagger and a robust parser. The English/Polish bi-text and the French/Polish bi-text are then aligned at the word level with the GIZA++ package. The intersection of IBM-4 Viterbi alignments for both ...
متن کاملTagset Design and Inflected Languages
An experiment designed to explore the relationship between tagging accuracy and the nature of the tagset is described, using corpora in English, French and Swedish. In particular, the question of internal versus external criteria for tagset design is considered, with the general conclusion that external (linguistic) criteria should be followed. Some problems associated with tagging unknown word...
متن کاملOptimizing Rule-Based Morphosyntactic Analysis of Richly Inflected Languages - a Polish Example
We consider finite-state optimization of morphosyntactic analysis of richly and ambiguously annotated corpora. We propose a general algorithm which, despite being surprisingly simple, proved to be effective in several applications for rulesets which do not match frequently.
متن کاملProjecting POS tags and syntactic dependencies from English and French to Polish aligned corpora
This paper presents the first step to project POS tags and dependencies from English and French to Polish in aligned corpora. Both the English and French parts of the corpus are analysed with a POS tagger and a robust parser. The English/Polish bi-text and the French/Polish bi-text are then aligned at the word level with the GIZA++ package. The intersection of IBM-4 Viterbi alignments for both ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Journal of Quantitative Linguistics
سال: 2022
ISSN: ['0929-6174', '1744-5035']
DOI: https://doi.org/10.1080/09296174.2022.2122751